Whats Covered

Aditional Resources


Assembling data


Binds

  • Base R binds
    • rbind, cbind
  • dplyr binds
    • bind_rows, bind_cols

 

  • Advantages of dplyr versions
    • faster
    • return a tibble (advanced data frame)
    • can handle lists of dataframes
    • .id argument in bind_rows allows you to pass in a name for each source dataframe
    • this will be added as a column to the result dataset
    • this is really handy when one of the variables for the data set is stored in the table name.
    • e.g. houly weather data for different days
    • rbind will return an error if the column names do not match exactly.
    • bind_rows will create a column for each unique column and distribute NAs appropriately

Which bind?

side_one and side_two contain tracks from sides one and two, respectively, of Pink Floyd’s famous album The Dark Side of the Moon.

Bind these datasets into a single table using a dplyr function. Which type of bind should you use?

# Examine side_one and side_two
side_one  %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song length
Speak to Me 01:30:00
Breathe 02:43:00
On the Run 03:30:00
Time 06:53:00
The Great Gig in the Sky 04:15:00
side_two  %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song length
Money 06:30:00
Us and Them 07:51:00
Any Colour You Like 03:24:00
Brain Damage 03:50:00
Eclipse 02:03:00
# Bind side_one and side_two into a single dataset
side_one %>% 
  bind_rows(side_two) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song length
Speak to Me 01:30:00
Breathe 02:43:00
On the Run 03:30:00
Time 06:53:00
The Great Gig in the Sky 04:15:00
Money 06:30:00
Us and Them 07:51:00
Any Colour You Like 03:24:00
Brain Damage 03:50:00
Eclipse 02:03:00

Bind rows

discography and jimi contain all of the information you need to create an anthology dataset for the band The Jimi Hendrix Experience.

discography contains a data frame of each album by The Jimi Hendrix Experience and the year of the album.

jimi contains a list of data frames of album tracks, one for each album released by The Jimi Hendrix Experience. As explained, you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame.

# Examine discography and jimi
discography
## # A tibble: 3 x 2
##   album                year
##   <chr>               <int>
## 1 Are You Experienced  1967
## 2 Axis: Bold as Love   1967
## 3 Electric Ladyland    1968
jimi
## $`Are You Experienced`
## # A tibble: 10 x 2
##    song                     length
##    <chr>                    <time>
##  1 Purple Haze              02:46 
##  2 Manic Depression         03:46 
##  3 Hey Joe                  03:23 
##  4 May This Be Love         03:14 
##  5 I Don't Live Today       03:55 
##  6 The Wind Cries Mary      03:21 
##  7 Fire                     02:34 
##  8 Third Stone from the Sun 06:40 
##  9 Foxy Lady                03:15 
## 10 Are You Experienced?     03:55 
## 
## $`Axis: Bold As Love`
## # A tibble: 13 x 2
##    song                 length
##    <chr>                <time>
##  1 EXP                  01:55 
##  2 Up from the Skies    02:55 
##  3 Spanish Castle Magic 03:00 
##  4 Wait Until Tomorrow  03:00 
##  5 Ain't No Telling     01:46 
##  6 Little Wing          02:24 
##  7 If 6 was 9           05:32 
##  8 You Got Me Floatin   02:45 
##  9 Castles Made of Sand 02:46 
## 10 She's So Fine        02:37 
## 11 One Rainy Wish       03:40 
## 12 Little Miss Lover    02:20 
## 13 Bold as Love         04:11 
## 
## $`Electric Ladyland`
## # A tibble: 16 x 2
##    song                                       length
##    <chr>                                      <time>
##  1 And the Gods Made Love                     01:21 
##  2 Have You Ever Been (To Electric Ladyland)  02:11 
##  3 Crosstown Traffic                          02:25 
##  4 Voodoo Chile                               15:00 
##  5 Little Miss Strange                        02:52 
##  6 Long Hot Summer Night                      03:27 
##  7 Come On (Part 1)                           04:09 
##  8 Gypsy Eyes                                 03:43 
##  9 Burning of the Midnight Lamp               03:39 
## 10 Rainy Day, Dream Away                      03:42 
## 11 1983... (A Merman I Should Turn to Be)     13:39 
## 12 Moon, Turn the Tides... Gently Gently Away 01:02 
## 13 Still Raining, Still Dreaming              04:25 
## 14 House Burning Down                         04:33 
## 15 All Along the Watchtower                   04:01 
## 16 Voodoo Child (Slight Return)               05:12
jimi %>% 
  # Bind jimi by rows, with ID column "album"
  bind_rows(.id = c("album")) %>% 
  # Make a complete data frame
  left_join(discography, by = c("album")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
album song length year
Are You Experienced Purple Haze 02:46:00 1967
Are You Experienced Manic Depression 03:46:00 1967
Are You Experienced Hey Joe 03:23:00 1967
Are You Experienced May This Be Love 03:14:00 1967
Are You Experienced I Don’t Live Today 03:55:00 1967
Are You Experienced The Wind Cries Mary 03:21:00 1967
Are You Experienced Fire 02:34:00 1967
Are You Experienced Third Stone from the Sun 06:40:00 1967
Are You Experienced Foxy Lady 03:15:00 1967
Are You Experienced Are You Experienced? 03:55:00 1967
Axis: Bold As Love EXP 01:55:00 NA
Axis: Bold As Love Up from the Skies 02:55:00 NA
Axis: Bold As Love Spanish Castle Magic 03:00:00 NA
Axis: Bold As Love Wait Until Tomorrow 03:00:00 NA
Axis: Bold As Love Ain’t No Telling 01:46:00 NA
Axis: Bold As Love Little Wing 02:24:00 NA
Axis: Bold As Love If 6 was 9 05:32:00 NA
Axis: Bold As Love You Got Me Floatin 02:45:00 NA
Axis: Bold As Love Castles Made of Sand 02:46:00 NA
Axis: Bold As Love She’s So Fine 02:37:00 NA
Axis: Bold As Love One Rainy Wish 03:40:00 NA
Axis: Bold As Love Little Miss Lover 02:20:00 NA
Axis: Bold As Love Bold as Love 04:11:00 NA
Electric Ladyland And the Gods Made Love 01:21:00 1968
Electric Ladyland Have You Ever Been (To Electric Ladyland) 02:11:00 1968
Electric Ladyland Crosstown Traffic 02:25:00 1968
Electric Ladyland Voodoo Chile 15:00:00 1968
Electric Ladyland Little Miss Strange 02:52:00 1968
Electric Ladyland Long Hot Summer Night 03:27:00 1968
Electric Ladyland Come On (Part 1) 04:09:00 1968
Electric Ladyland Gypsy Eyes 03:43:00 1968
Electric Ladyland Burning of the Midnight Lamp 03:39:00 1968
Electric Ladyland Rainy Day, Dream Away 03:42:00 1968
Electric Ladyland 1983… (A Merman I Should Turn to Be) 13:39:00 1968
Electric Ladyland Moon, Turn the Tides… Gently Gently Away 01:02:00 1968
Electric Ladyland Still Raining, Still Dreaming 04:25:00 1968
Electric Ladyland House Burning Down 04:33:00 1968
Electric Ladyland All Along the Watchtower 04:01:00 1968
Electric Ladyland Voodoo Child (Slight Return) 05:12:00 1968

Bind columns

Let’s make a compilation of Hank Williams’ 67 singles. To do this, you can use hank_years and hank_charts:

hank_years contains the name and release year of each of Hank Williams’ 67 singles.
hank_charts contains the name of each of Hank Williams’ 67 singles as well as the highest position it earned on the Billboard sales charts.

Each dataset contains the same songs, but hank_years is arranged chronologically by year, while hank_charts is arranged alphabetically by song title.

# Examine hank_years and hank_charts
hank_years %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year song
1947 Move It On Over
1947 My Love for You (Has Turned to Hate)
1947 Never Again (Will I Knock on Your Door)
1947 On the Banks of the Old Ponchartrain
1947 Pan American
1947 Wealth Won’t Save Your Soul
1948 A Mansion on the Hill
1948 Honky Tonkin’
1948 I Saw the Light
1948 I’m a Long Gone Daddy
1948 My Sweet Love Ain’t Around
1949 I’m So Lonesome I Could Cry
1949 Lost Highway
1949 Lovesick Blues
1949 Mind Your Own Business
1949 My Bucket’s Got a Hole in It
1949 Never Again (Will I Knock on Your Door)
1949 Wedding Bells
1949 You’re Gonna Change (Or I’m Gonna Leave)
1950 I Just Don’t Like This Kind of Living
1950 Long Gone Lonesome Blues
1950 Moanin’ the Blues
1950 My Son Calls Another Man Daddy
1950 Nobody’s Lonesome for Me
1950 They’ll Never Take Her Love from Me
1950 Why Don’t You Love Me
1950 Why Should We Try Anymore
1951 (I Heard That) Lonesome Whistle
1951 Baby, We’re Really in Love
1951 Cold, Cold Heart
1951 Crazy Heart
1951 Dear John
1951 Hey Good Lookin’
1951 Howlin’ At the Moon
1951 I Can’t Help It (If I’m Still in Love With You)
1952 Half as Much
1952 Honky Tonk Blues
1952 I’ll Never Get Out of This World Alive
1952 Jambalaya (On the Bayou)
1952 Settin’ the Woods on Fire
1952 You Win Again
1953 Calling You
1953 I Won’t Be Home No More
1953 Kaw-Liga
1953 Take These Chains from My Heart
1953 Weary Blues from Waitin’
1953 Your Cheatin’ Heart
1954 (I’m Gonna) Sing, Sing, Sing
1954 How Can You Refuse Him Now
1954 I’m Satisfied with You
1954 You Better Keep It on Your Mind
1955 A Teardrop on a Rose
1955 At the First Fall of Snow
1955 Mother Is Gone
1955 Please Don’t Let Me Love You
1955 Thank God
1956 A Home in Heaven
1956 California Zephyr
1956 Singing Waterfall
1956 There’s No Room in My Heart for the Blues
1957 Leave Me Alone with the Blues
1957 Ready to Go Home
1957 The Waltz of the Wind
1958 Just Waitin’
1965 The Pale Horse and His Rider
1966 Kaw-Liga
1989 There’s a Tear in My Beer
hank_charts %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song peak
(I Heard That) Lonesome Whistle 9
(I’m Gonna) Sing, Sing, Sing NA
A Home in Heaven NA
A Mansion on the Hill 12
A Teardrop on a Rose NA
At the First Fall of Snow NA
Baby, We’re Really in Love 4
California Zephyr NA
Calling You NA
Cold, Cold Heart 1
Crazy Heart 4
Dear John 8
Half as Much 2
Hey Good Lookin’ 1
Honky Tonk Blues 2
Honky Tonkin’ 14
How Can You Refuse Him Now NA
Howlin’ At the Moon 3
I Can’t Help It (If I’m Still in Love With You) 2
I Just Don’t Like This Kind of Living 5
I Saw the Light NA
I Won’t Be Home No More 4
I’ll Never Get Out of This World Alive 1
I’m a Long Gone Daddy 6
I’m Satisfied with You NA
I’m So Lonesome I Could Cry 2
Jambalaya (On the Bayou) 1
Just Waitin’ NA
Kaw-Liga 1
Kaw-Liga NA
Leave Me Alone with the Blues NA
Long Gone Lonesome Blues 1
Lost Highway 12
Lovesick Blues 1
Mind Your Own Business 5
Moanin’ the Blues 1
Mother Is Gone NA
Move It On Over 4
My Bucket’s Got a Hole in It 2
My Love for You (Has Turned to Hate) NA
My Son Calls Another Man Daddy 9
My Sweet Love Ain’t Around NA
Never Again (Will I Knock on Your Door) NA
Never Again (Will I Knock on Your Door) 6
Nobody’s Lonesome for Me 9
On the Banks of the Old Ponchartrain NA
Pan American NA
Please Don’t Let Me Love You 9
Ready to Go Home NA
Settin’ the Woods on Fire 2
Singing Waterfall NA
Take These Chains from My Heart 1
Thank God NA
The Pale Horse and His Rider NA
The Waltz of the Wind NA
There’s a Tear in My Beer 7
There’s No Room in My Heart for the Blues NA
They’ll Never Take Her Love from Me 5
Wealth Won’t Save Your Soul NA
Weary Blues from Waitin’ 7
Wedding Bells 2
Why Don’t You Love Me 1
Why Should We Try Anymore 9
You Better Keep It on Your Mind NA
You Win Again 10
You’re Gonna Change (Or I’m Gonna Leave) 4
Your Cheatin’ Heart 1
hank_years %>% 
  # Reorder hank_years alphabetically by song title
  arrange(song) %>% 
  # Select just the year column
  select(year) %>% 
  # Bind the year column
  bind_cols(hank_charts) %>% 
  # Arrange the finished dataset
  arrange(year, song) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year song peak
1947 Move It On Over 4
1947 My Love for You (Has Turned to Hate) NA
1947 Never Again (Will I Knock on Your Door) NA
1947 On the Banks of the Old Ponchartrain NA
1947 Pan American NA
1947 Wealth Won’t Save Your Soul NA
1948 A Mansion on the Hill 12
1948 Honky Tonkin’ 14
1948 I’m Satisfied with You NA
1948 I Just Don’t Like This Kind of Living 5
1948 My Sweet Love Ain’t Around NA
1949 I Won’t Be Home No More 4
1949 Lost Highway 12
1949 Lovesick Blues 1
1949 Mind Your Own Business 5
1949 My Bucket’s Got a Hole in It 2
1949 Never Again (Will I Knock on Your Door) 6
1949 Wedding Bells 2
1949 You Better Keep It on Your Mind NA
1950 I’m a Long Gone Daddy 6
1950 Long Gone Lonesome Blues 1
1950 Moanin’ the Blues 1
1950 My Son Calls Another Man Daddy 9
1950 Nobody’s Lonesome for Me 9
1950 They’ll Never Take Her Love from Me 5
1950 Why Don’t You Love Me 1
1950 Why Should We Try Anymore 9
1951 (I’m Gonna) Sing, Sing, Sing NA
1951 Baby, We’re Really in Love 4
1951 Cold, Cold Heart 1
1951 Crazy Heart 4
1951 Dear John 8
1951 Hey Good Lookin’ 1
1951 Howlin’ At the Moon 3
1951 I’ll Never Get Out of This World Alive 1
1952 Half as Much 2
1952 Honky Tonk Blues 2
1952 I Can’t Help It (If I’m Still in Love With You) 2
1952 Jambalaya (On the Bayou) 1
1952 Settin’ the Woods on Fire 2
1952 You’re Gonna Change (Or I’m Gonna Leave) 4
1953 Calling You NA
1953 I’m So Lonesome I Could Cry 2
1953 Kaw-Liga 1
1953 Take These Chains from My Heart 1
1953 Weary Blues from Waitin’ 7
1953 Your Cheatin’ Heart 1
1954 (I Heard That) Lonesome Whistle 9
1954 How Can You Refuse Him Now NA
1954 I Saw the Light NA
1954 You Win Again 10
1955 A Teardrop on a Rose NA
1955 At the First Fall of Snow NA
1955 Mother Is Gone NA
1955 Please Don’t Let Me Love You 9
1955 Thank God NA
1956 A Home in Heaven NA
1956 California Zephyr NA
1956 Singing Waterfall NA
1956 There’s No Room in My Heart for the Blues NA
1957 Leave Me Alone with the Blues NA
1957 Ready to Go Home NA
1957 The Waltz of the Wind NA
1958 Just Waitin’ NA
1965 The Pale Horse and His Rider NA
1966 Kaw-Liga NA
1989 There’s a Tear in My Beer 7

Danger

  • The biggest risk when using bind_cols is that the observations may not align acros the rows of your data frame
  • Indeed, you will create meaningless data if you column bind two datasets whose rows do not align. Unfortunately, there is usually no clear way to tell whether or not the rows in two datasets align unless the datasets contain a mutual key. In that scenario, you can use a mutating join to bind the datasets in a foolproof way.

Build a better data frame

  • R Base way to create a dataframe
    • data.frame() and as.data.frame()
  • dplyr way to create a dataframe
    • data_frame() and as_data_frame()
  • Advantages of using data_frame():
    • will never change the data type of a vector. (e.g. strings to factors)
    • will never add row names
    • will not changes unusual column names
    • only recycles length 1 inputs
    • evaluate lazily and in order. (So you can reference a column in the next column definition)
    • outputs a tibble (class tbl_df)
  • as_data_frame does all these same things
    • its just used on a list of columns

Make a data frame

Let’s make a Greatest Hits compilation for Hank Williams. hank_year, hank_song, and hank_peak contain the columns of the data frame you made in the last exercise.

# Make combined data frame using data_frame()
data_frame(year = hank_year, song = hank_song, peak = hank_peak) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year song peak
1949 Lovesick Blues 1
1950 Long Gone Lonesome Blues 1
1950 Moanin’ the Blues 1
1950 Why Don’t You Love Me 1
1951 Cold, Cold Heart 1
1951 Hey Good Lookin’ 1
1952 I’ll Never Get Out of This World Alive 1
1952 Jambalaya (On the Bayou) 1
1953 Kaw-Liga 1
1953 Take These Chains from My Heart 1
1953 Your Cheatin’ Heart 1

Lists of columns

As a data scientist, you should always be prepared to handle raw data that comes in many different formats.

hank saves Hank Williams’ singles in a different way, as a list of vectors. Can you turn hank into the same dataset that you made in the last exercise?

# Examine the contents of hank
str(hank)
## List of 3
##  $ year: int [1:67] 1947 1947 1947 1947 1947 1947 1948 1948 1948 1948 ...
##  $ song: chr [1:67] "Move It On Over" "My Love for You (Has Turned to Hate)" "Never Again (Will I Knock on Your Door)" "On the Banks of the Old Ponchartrain" ...
##  $ peak: int [1:67] 4 NA NA NA NA NA 12 14 NA 2 ...
# Convert the hank list into a data frame
as_data_frame(hank) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year song peak
1949 Lovesick Blues 1
1950 Long Gone Lonesome Blues 1
1950 Moanin’ the Blues 1
1950 Why Don’t You Love Me 1
1951 Cold, Cold Heart 1
1951 Hey Good Lookin’ 1
1952 I’ll Never Get Out of This World Alive 1
1952 Jambalaya (On the Bayou) 1
1953 Kaw-Liga 1
1953 Take These Chains from My Heart 1
1953 Your Cheatin’ Heart 1

Lists of rows (data frames)

michael contains a list of data frames, one for each album released by Michael Jackson. The code in the editor attempts to bind the data frames into a single data frame and then extract a data frame of the top tracks on each album.

However, the code runs into a problem. The commented line fails because as_data_frame() combines a list of column vectors into a data frame. However, michael is a list of data frames.

# Examine the contents of michael
michael
## $`Got to Be There`
## # A tibble: 10 x 2
##    song                               peak
##    <chr>                             <int>
##  1 Ain'T No Sunshine                    NA
##  2 I Wanna be Where You Are             NA
##  3 Girl Don't Take Your Love from Me    NA
##  4 In Our Small Way                     NA
##  5 Got to Be There                       4
##  6 Rockin' Robin                         2
##  7 Wings of My Love                     NA
##  8 Maria (You Were the Only One)        NA
##  9 Love is Here and Now You're Gone     NA
## 10 You've Got a Friend                  NA
## 
## $Ben
## # A tibble: 10 x 2
##    song                             peak
##    <chr>                           <int>
##  1 Ben                                 1
##  2 Greatest Show On Earth             NA
##  3 People Make the World Go 'Round    NA
##  4 We've Got a Good Thing             NA
##  5 Everybody's Somebody's Fool        NA
##  6 My Girl                            NA
##  7 What Goes Around Comes Around      NA
##  8 In Our Small Way                   NA
##  9 Shoo-Be-Doo-Be-Doo-Da-Day          NA
## 10 You Can Cry On My Shoulder         NA
## 
## $`Music & Me`
## # A tibble: 10 x 2
##    song                    peak
##    <chr>                  <int>
##  1 With a Child's Heart      50
##  2 Up Again                  NA
##  3 All the Things You Are    NA
##  4 Happy                     NA
##  5 Too Young                 NA
##  6 Doggin' Around            NA
##  7 Euphoria                  NA
##  8 Morning Glow              NA
##  9 Johnny Raven              NA
## 10 Music and Me              NA
## 
## $`Forever, Michael`
## # A tibble: 10 x 2
##    song                      peak
##    <chr>                    <int>
##  1 We're Almost There          54
##  2 Take Me Back                NA
##  3 One Day in Your Life        NA
##  4 Cinderella Stay Awhile      NA
##  5 We've Got Forever           NA
##  6 Just a Little Bit of You    23
##  7 You Are There               NA
##  8 Dapper-Dan                  NA
##  9 Dear Michael                NA
## 10 I'll Come Home to You       NA
## 
## $`Off the Wall`
## # A tibble: 10 x 2
##    song                            peak
##    <chr>                          <int>
##  1 Don't Stop 'Til You Get Enough     1
##  2 Rock with You                      1
##  3 Working Day and Night             NA
##  4 Get on the Floor                  NA
##  5 Off the Wall                      10
##  6 Girlfriend                        NA
##  7 She's Out of My Life              10
##  8 I Can't Help It                   NA
##  9 It's the Falling in Love          NA
## 10 Burn This Disco Out               NA
## 
## $Thriller
## # A tibble: 9 x 2
##   song                         peak
##   <chr>                       <int>
## 1 Wanna Be Startin' Somethin      5
## 2 Baby be Mine                   NA
## 3 The Girl is Mine                2
## 4 Thriller                        4
## 5 Beat It                         1
## 6 Billie Jean                     1
## 7 Human Nature                    7
## 8 P.Y.T. (Pretty Young Thing)    10
## 9 The Lady in My Life            NA
## 
## $Bad
## # A tibble: 10 x 2
##    song                          peak
##    <chr>                        <int>
##  1 Bad                              1
##  2 The Way You Make Me Feel         1
##  3 Speed Demon                     NA
##  4 Liberian Girl                   NA
##  5 Just Good Friends               NA
##  6 Another Part of Me              11
##  7 Man in the Mirror                1
##  8 I Just Can't Stop Loving You     1
##  9 Dirty Diana                      1
## 10 Smooth Criminal                  7
## 
## $Dangerous
## # A tibble: 14 x 2
##    song                      peak
##    <chr>                    <int>
##  1 Jam                         26
##  2 Why You Wanna Trip on Me    NA
##  3 In the Closet                6
##  4 She Drives Me Wild          NA
##  5 Remember the Time            3
##  6 Can't Let Her Get Away      NA
##  7 Heal the World              27
##  8 Black or White               1
##  9 Who Is It                   14
## 10 Give In to Me               NA
## 11 Will You Be There            7
## 12 Keep the Faith              NA
## 13 Gone Too Soon               NA
## 14 Dangerous                   NA
## 
## $HIStory
## # A tibble: 30 x 2
##    song                          peak
##    <chr>                        <int>
##  1 Billie Jean                     NA
##  2 The Way You Make Me Feel        NA
##  3 Black or White                  NA
##  4 Rock with You                   NA
##  5 She's Out of My Life            NA
##  6 Bad                             NA
##  7 I Just Can't Stop Loving You    NA
##  8 Man in the Mirror               NA
##  9 Thriller                        NA
## 10 Beat It                         NA
## # ... with 20 more rows
## 
## $Invincible
## # A tibble: 16 x 2
##    song               peak
##    <chr>             <int>
##  1 Unbreakable          NA
##  2 Heartbreaker         NA
##  3 Invincible           NA
##  4 Break of Dawn        NA
##  5 Heaven Can Wait      NA
##  6 You Rock My World    10
##  7 Butterflies          NA
##  8 Speechless           NA
##  9 2000 Watts           NA
## 10 You Are My Life      NA
## 11 Privacy              NA
## 12 Don't Walk Away      NA
## 13 Cry                  NA
## 14 The Lost Children    NA
## 15 Whatever Happens     NA
## 16 Threatened           NA
# Replace the first line so each album has its own rows
bind_rows(michael, .id = c("album")) %>% 
  group_by(album) %>% 
  mutate(rank = min_rank(peak)) %>% 
  filter(rank == 1) %>% 
  select(-rank, -peak) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
album song
Got to Be There Rockin’ Robin
Ben Ben
Music & Me With a Child’s Heart
Forever, Michael Just a Little Bit of You
Off the Wall Don’t Stop ’Til You Get Enough
Off the Wall Rock with You
Thriller Beat It
Thriller Billie Jean
Bad Bad
Bad The Way You Make Me Feel
Bad Man in the Mirror
Bad I Just Can’t Stop Loving You
Bad Dirty Diana
Dangerous Black or White
HIStory You Are Not Alone
Invincible You Rock My World

Working with data types

Be aware of data types when combining vectors or data frames
- R will need to make a decision about the resulting type if the original types differ for a vector/column

<div style="width:300px; height=200px">
![](../assets/images/diff_data_types.png)
</div>  
  • 6 atomic data types in R
    • logical
    • character
    • double
    • integer
    • complex
    • raw
typeof(TRUE)
## [1] "logical"
typeof("hello")
## [1] "character"
typeof(3.14)
## [1] "double"
typeof(1L)
## [1] "integer"
typeof(1 + 2i)
## [1] "complex"
typeof(raw(1))
## [1] "raw"

Atomic data types

  • R will coerce logicals and numerics to strings if strings are present
  • R will coerce logicals to numerics if numerics are present
  • as.character() will coerce factors to their labels as strings if strings are present
  • as.numeric() will coerce factors to their levels as numerics if numerics are present

dplyr’s coercion rules

Base R coercion rules:

  • in general the more specific type of data will convert to the more general
    • integer and logical will go to integer
    • double and integer/locgical will go to double
    • character and integer/double/logical will go to character
  • factors are tricky
    • if joining to an integer or double the factor integer values will be used
    • if joining to a character the factor labels will be used
    • be careful if you have numeric labels for your factors

dplyr’s coercion rules:

  • dplyr functions will not automatically coerce data types
    • returns and error
    • expects you to manyally coerce data
  • Exception: factors
    • dplyr coverts non-aligning factors to strings
    • gives warning message

dplyr and coercion

  • dplyr’s two table functions will
    • coerce two factors to a string if the factors have different levels
    • coerce two factors to a string if the factors have the same levels in a different order
    • return an error instead of coercing logicals, integers, and numerics to a character

Determining type

  • eighties has the year variable stores as a factor
  • sevnties has it stored as a character
  • when combined it will be a character
    • we will get a warning message
seventies %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year album band
1970 Bridge Over Troubled Water Simon and Garfunkel
1971 Jesus Christ Superstar Various Artists
1972 Harvest Neil Young
1973 The World is a Ghetto War
1974 Goodbye Yellow Brick Road Elton John
1975 Elton John’s Greatest Hits Elton John
1976 Peter Frampton Frampton Comes Alive
1977 Rumours Fleetwood Mac
1978 Saturday Night Fever Bee Gees
1979 Billy Joel 52nd Street
eighties %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year album band
1980 The Wall Pink Floyd
1981 Hi Infidelity REO Speedwagon
1982 Asia Asia
1983 Thriller Michael Jackson
1984 Thriller Michael Jackson
1985 Born in the U.S.A. Bruce Springsteen
1986 Whitney Houston Whitney Houston
1987 Slippery When Wet Bon Jovi
1988 Faith George Michael
1989 Don’t Be Cruel Bobby Brown
both <- seventies %>% bind_rows(eighties)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
both %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year album band
1970 Bridge Over Troubled Water Simon and Garfunkel
1971 Jesus Christ Superstar Various Artists
1972 Harvest Neil Young
1973 The World is a Ghetto War
1974 Goodbye Yellow Brick Road Elton John
1975 Elton John’s Greatest Hits Elton John
1976 Peter Frampton Frampton Comes Alive
1977 Rumours Fleetwood Mac
1978 Saturday Night Fever Bee Gees
1979 Billy Joel 52nd Street
1980 The Wall Pink Floyd
1981 Hi Infidelity REO Speedwagon
1982 Asia Asia
1983 Thriller Michael Jackson
1984 Thriller Michael Jackson
1985 Born in the U.S.A. Bruce Springsteen
1986 Whitney Houston Whitney Houston
1987 Slippery When Wet Bon Jovi
1988 Faith George Michael
1989 Don’t Be Cruel Bobby Brown

Results

sixties contains the top selling albums in the US in the 1960s. It stores year as a numeric (double). When you combine it with seventies, which stores year as a factor, bind_rows() returns an error.

You can fix this by coercing seventies$year to a numeric. But if you do it like this, something surprising happens.

seventies %>% 
  mutate(year = as.numeric(year)) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year album band
1 Bridge Over Troubled Water Simon and Garfunkel
2 Jesus Christ Superstar Various Artists
3 Harvest Neil Young
4 The World is a Ghetto War
5 Goodbye Yellow Brick Road Elton John
6 Elton John’s Greatest Hits Elton John
7 Peter Frampton Frampton Comes Alive
8 Rumours Fleetwood Mac
9 Saturday Night Fever Bee Gees
10 Billy Joel 52nd Street

Can you fix things?

# sixties$year is a numeric and seventies$year is a factor
str(sixties)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : num  1960 1961 1962 1963 1964 ...
##  $ album: chr  "The Sound of Music" "Camelot" "West Side Story" "West Side Story" ...
##  $ band : chr  "Original Broadway Cast" "Original Broadway Cast" "Soundtrack" "Soundtrack" ...
str(seventies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : Factor w/ 10 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10
##  $ album: chr  "Bridge Over Troubled Water" "Jesus Christ Superstar" "Harvest" "The World is a Ghetto" ...
##  $ band : chr  "Simon and Garfunkel" "Various Artists" "Neil Young" "War" ...
# This gives an error about not converting factor to numeric (i.e Column `year` can't be converted from factor to numeric)
#seventies %>% bind_rows(sixties)

# So lets convert it....
# But this converts the year to the integer factor value, not the year. Whoops
seventies %>% 
  mutate(year = as.numeric(year))
## # A tibble: 10 x 3
##     year album                      band                
##    <dbl> <chr>                      <chr>               
##  1     1 Bridge Over Troubled Water Simon and Garfunkel 
##  2     2 Jesus Christ Superstar     Various Artists     
##  3     3 Harvest                    Neil Young          
##  4     4 The World is a Ghetto      War                 
##  5     5 Goodbye Yellow Brick Road  Elton John          
##  6     6 Elton John's Greatest Hits Elton John          
##  7     7 Peter Frampton             Frampton Comes Alive
##  8     8 Rumours                    Fleetwood Mac       
##  9     9 Saturday Night Fever       Bee Gees            
## 10    10 Billy Joel                 52nd Street
# We need to make it a character first to get the year, then convert to numeric
seventies %>% mutate(year = as.numeric(as.character(year)))
## # A tibble: 10 x 3
##     year album                      band                
##    <dbl> <chr>                      <chr>               
##  1  1970 Bridge Over Troubled Water Simon and Garfunkel 
##  2  1971 Jesus Christ Superstar     Various Artists     
##  3  1972 Harvest                    Neil Young          
##  4  1973 The World is a Ghetto      War                 
##  5  1974 Goodbye Yellow Brick Road  Elton John          
##  6  1975 Elton John's Greatest Hits Elton John          
##  7  1976 Peter Frampton             Frampton Comes Alive
##  8  1977 Rumours                    Fleetwood Mac       
##  9  1978 Saturday Night Fever       Bee Gees            
## 10  1979 Billy Joel                 52nd Street
# Now we can join the datasets correctly
seventies %>% 
  # Coerce seventies$year into a useful numeric
  mutate(year = as.numeric(as.character(year))) %>% 
  # Bind the updated version of seventies to sixties
  bind_rows(sixties) %>% 
  arrange(year) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year album band
1960 The Sound of Music Original Broadway Cast
1961 Camelot Original Broadway Cast
1962 West Side Story Soundtrack
1963 West Side Story Soundtrack
1964 Hello, Dolly! Original Broadway Cast
1965 Mary Poppins Soundtrack
1966 Whipped Cream & Other Delights Herb Alpert & The Tijuana Brass
1967 More of The Monkees The Monkees
1968 Are You Experienced? The Jimi Hendrix Experience
1969 In-A-Gadda-Da-Vida Iron Butterfly
1970 Bridge Over Troubled Water Simon and Garfunkel
1971 Jesus Christ Superstar Various Artists
1972 Harvest Neil Young
1973 The World is a Ghetto War
1974 Goodbye Yellow Brick Road Elton John
1975 Elton John’s Greatest Hits Elton John
1976 Peter Frampton Frampton Comes Alive
1977 Rumours Fleetwood Mac
1978 Saturday Night Fever Bee Gees
1979 Billy Joel 52nd Street

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0  kableExtra_1.0.1
## [5] knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       purrr_0.3.0       magrittr_1.5     
## [34] codetools_0.2-15  scales_1.0.0      htmltools_0.3.6  
## [37] assertthat_0.2.0  rvest_0.3.2       colorspace_1.4-0 
## [40] utf8_1.1.4        stringi_1.3.1     lazyeval_0.2.1   
## [43] munsell_0.5.0     crayon_1.3.4